Floating-point numbers

Alt text

Floating-point number representation

For example, 312110000000000000000000 can be written as $3.1211 \times 10^{23}$ using scientific notation.
If we adopt this system in binary, we get:

M \times 2^{E}

M is the mantissa and E is the exponent.
This is known as binary floating-point representation.
In our examples, we will assume a computer is using 8bits to represent the mantissa and 8bits to store the exponent (a binary point is assumed to exist between the first and second bits of the mantissa).
Again, using denary as our example, a number such as $0.31211 \times 10^{24}$ means:

Alt text

We thus get the binary floating-point equivalent (using 8 bits for the mantissa and 8 bits for the exponent with the assumed binary point between –1 and 1/2 in the mantissa):

Alt text

In binary, we use $M i m e s 2^{E}$ to represent floating-point numbers, M is the and E is the .

[0/2]

Convert this binary floating-point number into denary

Alt text

For exmaple: 0101 1010 0000 0100

Method 1

M = 1/2 + 1/8 + 1/16 + 64/1 = 45/64
E = 4
$M \times 2^{E}$
= 45/64 x 24
= 0.703125 x 16
= 11.25

Method 2

M = 0.1011010
E = 4

Alt text

Shift point to right with 4 digit: 01011.010
= 8 + 2 + 1 + 1/4 + 1/8
= 11.25

Convert this binary floating-point number into denary:0010 1000 0000 0011

[0/1]

Converting denary numbers into binary floating-point numbers

Convert +4.5 into a binary floating-point number.

Method 1

4.5 = 9/2 = 9/16 x 2³
M = 9/16 = 1/2 + 1/16
E = 3

Alt text

M = 01001000
E = 00000011

Ans: 01001000.00000011

Method 2

4 = 0100 and .5 = .1 which gives: 0100.1
0100.1 = 0.1001 x 11(moving three(11) places left)

Alt text

Ans: 01001000.00000011

Converting denary numbers into binary floating-point numbers:0.171875

[0/1]

Normalisation

0.1000000 00000010	=1/2*2²	=2
0.0100000 00000011	=1/4*2³	=2
0.0010000 00000100	=1/8*2⁴	=2
0.0001000 00000101	=1/16*2⁵	=2

With this method, for a positive number, the mantissa must start with 0.1 (as in our first representation of 2 above).
The bits in the mantissa are shifted to the left until we arrive at 0.1; for each shift left, the exponent is reduced by 1.
Look at the examples above to see how this works (starting with 0.0001000 we shift the bits 3 places to the left to get to 0.100000 and we reduce the exponent by 3 to now give 00000010, so we end up with the first representation!).
For a negative number the mantissa must start with 1.0.
The bits in the mantissa are shifted until we arrive at 1.0; again, the exponent must be changed to reflect the number of shifts.

With this method, for a positive number, the mantissa must start with , for a negative number the mantissa must start with

[0/2]

Potential rounding errors and approximations

.88 × 2 = 1.76 so we will use the 1 value to give 0.1
.76 × 2 = 1.52 so we will use the 1 value to give 0.11
.52 × 2 = 1.04 so we will use the 1 value to give 0.111
.04 × 2 = 0.08 so we will use the 0 value to give 0.1110
.08 × 2 = 0.16 so we will use the 0 value to give 0.11100
.16 × 2 = 0.32 so we will use the 0 value to give 0.111000
.32 × 2 = 0.64 so we will use the 0 value to give 0.1110000
.64 × 2 = 1.28 so we will use the 1 value to give 0.11100001
5.88 = 0101.11100001 = 0.1011100 00000011 = 23/32 x 23 = 23/4 = 5.75
So, 5.88 is stored as 5.75 in our floating-point system.

Precision vs Range

The accuracy of a number can be increased by increasing the number of bits used in the mantissa.
The range of numbers can be increased by increasing the number of bits used in the exponent.
Accuracy and range will always be a trade-off between mantissa and exponent size.

TIP

The mantissa is 12 bits and the exponent is 4 bits. This gives a largest positive value of (2047/2048)x27; which gives high accuracy but small range.

Alt text

TIP

The mantissa is 4 bits and the exponent is 12 bits. This gives a largest possible value of (7/8)×22047, which gives poor accuracy but extremely high range.

Alt text

The accuracy of a number can be increased by increasing the number of bits used in the , the range of numbers can be increased by increasing the number of bits used in the

[0/2]

Overflow and underflow

There are additional problems:
- If a calculation produces a number which exceeds the maximum possible value that can be stored in the mantissa and exponent, an overflow error will be produced. This could occur when trying to divide by a very small number or even 0.
- When dividing by a very large number this can lead to a result which is less than the smallest number that can be stored. This would lead to an underflow error.
- One of the issues of using normalised binary floating-point numbers is the inability to store the number zero. This is because the mantissa must be 0.1 or 1.0 which does not allow for a zero value.

If a calculation produces a number which exceeds the maximum possible value that can be stored in the mantissa and exponent, an error will be produced. When dividing by a very large number this can lead to a result which is less than the smallest number that can be stored. This would lead to an error

[0/2]

1 Information Representation

2 Communication

3 Hardware

4 Processor Fundamentals

5 System Software

6 Security, Privacy And Data Integrity

7 Ethics and Ownership

8 Databases

9 Algorithm Design and Problem Solving

10 Data Types and structures

11 Programming

12 Software Development

13 Data Representation

14 Communication and Internet Technologies

15 Hardwares

16 System Software

17 Security

18 Artificial Intelligence (AI)

19 Computational thinking and Problem-solving

20 Further Programming

Floating-point numbers

Floating-point number representation

Convert this binary floating-point number into denary

Converting denary numbers into binary floating-point numbers

Normalisation

Potential rounding errors and approximations

Precision vs Range

Overflow and underflow

Floating-point numbers ​

Floating-point number representation ​

Convert this binary floating-point number into denary ​

Converting denary numbers into binary floating-point numbers ​

Normalisation ​

Potential rounding errors and approximations ​

Precision vs Range ​

Overflow and underflow ​

Floating-point numbers

Floating-point number representation

Convert this binary floating-point number into denary

Converting denary numbers into binary floating-point numbers

Normalisation

Potential rounding errors and approximations

Precision vs Range

Overflow and underflow